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VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS 

^ \ By Yingying Fan 1 and Runze Li 2 

■ University of Southern California and Pennsylvania State University 

^ 1 This paper is concerned with the selection and estimation of fixed 

O . an d random effects in linear mixed effects models. We propose a class 

■ of nonconcave penalized profile likelihood methods for selecting and 

estimating important fixed effects. To overcome the difficulty of un- 
known covariance matrix of random effects, we propose to use a proxy 
matrix in the penalized profile likelihood. We establish conditions on 

| the choice of the proxy matrix and show that the proposed procedure 

. enjoys the model selection consistency where the number of fixed ef- 

fects is allowed to grow exponentially with the sample size. We further 
propose a group variable selection strategy to simultaneously select 

■ and estimate important random effects, where the unknown covari- 
ance matrix of random effects is replaced with a proxy matrix. We 
prove that, with the proxy matrix appropriately chosen, the proposed 
procedure can identify all true random effects with asymptotic prob- 
ability one, where the dimension of random effects vector is allowed 
to increase exponentially with the sample size. Monte Carlo simula- 
tion studies are conducted to examine the finite-sample performance 
of the proposed procedures. We further illustrate the proposed pro- 
cedures via a real data example. 
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1. Introduction. During the last two decades, linear mixed effects mod- 
els [Laird and Ware (1982), Longford (1993)] have been widely used to model 
longitudinal and repeated measurements data, and have received much at- 
tention in the fields of agriculture, biology, economics, medicine and sociol- 
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ogy; see Verbeke and Molenberghs (2000) and references therein. With the 
advent of modern technology, many variables can be easily collected in a 
scientific study, and it is typical to include many of them in the full model 
at the initial stage of modeling to reduce model approximation error. Due to 
the complexity of the mixed effects models, the inferences and interpretation 
of estimated models become challenging as the dimension of fixed or ran- 
dom components increases. Thus the selection of important fixed or random 
components becomes a fundamental problem in the analysis of longitudinal 
or repeated measurements data using mixed effects models. 

Variable selection for mixed effects models has become an active research 
topic in the literature. Lin (1997) considers testing a hypothesis on the vari- 
ance component. The testing procedures can be used to detect whether an 
individual random component is significant or not. Based on these testing 
procedures, a stepwise procedure can be constructed for selecting important 
random effects. Vaida and Blanchard (2005) propose the conditional AIC, an 
extension of the AIC [Akaike (1973)], for mixed effects models with detailed 
discussion on how to define degrees of freedom in the presence of random 
effects. The conditional AIC has further been discussed in Liang, Wu and 
Zou (2008). Chen and Dunson (2003) develop a Bayesian variable selection 
procedure for selecting important random effects in the linear mixed effects 
model using the Cholesky decomposition of the covariance matrix of random 
effects, and specify a prior distribution on the standard deviation of random 
effects with a positive mass at zero to achieve the sparsity of random com- 
ponents. Pu and Niu (2006) extend the generalized information criterion 
to select linear mixed effects models and study the asymptotic behavior of 
the proposed method for selecting fixed effects. Bondell, Krishna and Ghosh 
(2010) propose a joint variable selection method for fixed and random effects 
in the linear mixed effects model using a modified Cholesky decomposition in 
the setting of fixed dimensionality for both fixed effects and random effects. 
Ibrahim et al. (2011) propose to select fixed and random effects in a general 
class of mixed effects models with fixed dimensions of both fixed and random 
effects using maximum penalized likelihood method with the SCAD penalty 
and the adaptive least absolute shrinkage and selection operator penalty. 

In this paper, we develop a class of variable selection procedures for both 
fixed effects and random effects in linear mixed effects models by incor- 
porating the recent advances in variable selection. We propose to use the 
regularization methods to select and estimate fixed and random effects. 
As advocated by Fan and Li (2001), regularization methods can avoid the 
stochastic error of variable selection in stepwise procedures, and can signif- 
icantly reduce computational cost compared with the best subset selection 
and Bayesian procedures. Our proposal differs from the existing ones in the 
literature mainly in two aspects. First, we consider the high-dimensional set- 
ting and allow dimension of fixed or random effects to grow exponentially 
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with the sample size. Second, our proposed procedures can estimate the fixed 
effects vector without estimating the random effects vector and vice versa. 

We first propose a class of variable selection methods for the fixed effects 
using penalized profile likelihood method. To overcome the difficulty of un- 
known covariance matrix of random effects, we propose to replace it with a 
suitably chosen proxy matrix. The penalized profile likelihood is equivalent 
to a penalized quadratic loss function of the fixed effects. Thus, the proposed 
approach can take advantage of the recent developments in the computa- 
tion of the penalized least-squares methods [Efron et al. (2004), Zou and Li 
(2008)]. The optimization of the penalized likelihood can be solved by the 
LARS algorithm without extra effort. We further systematically study the 
sampling properties of the resulting estimate of fixed effects. We establish 
conditions on the proxy matrix and show that the resulting estimate enjoys 
model selection oracle property under such conditions. In our theoretical in- 
vestigation, the number of fixed effects is allowed to grow exponentially with 
the total sample size, provided that the covariance matrix of random effects 
is nonsingular. In the case of singular covariance matrix for random effects, 
one can use our proposed method in Section 3 to first select important ran- 
dom effects and then conduct variable selection for fixed effects. In this case, 
the number of fixed effects needs to be smaller than the total sample size. 

Since the random effects vector is random, our main interest is in the 
selection of true random effects. Observe that if a random effect covariate 
is a noise variable, then the corresponding realizations of this random ef- 
fect should all be zero, and thus the random effects vector is sparse. So we 
propose to first estimate the realization of random effects vector using a 
group regularization method and then identify the important ones based on 
the estimated random effects vector. More specifically, under the Bayesian 
framework, we show that the restricted posterior distribution of the random 
effects vector is independent of the fixed effects coefficient vector. Thus, we 
propose a random effect selection procedure via penalizing the restricted 
posterior mode. The proposed procedure reduces the impact of error caused 
by the fixed effects selection and estimation. The unknown covariance matrix 
is replaced with a suitably chosen proxy matrix. In the proposed procedure, 
random effects selection is carried out with group variable selection tech- 
niques [Yuan and Lin (2006)]. The optimization of the penalized restricted 
posterior mode is equivalent to the minimization of the penalized quadratic 
function of random effects. In particular, the form of the penalized quadratic 
function is similar to that in the adaptive elastic net [Zou and Hastie (2005), 
Zou and Zhang (2009)], which allows us to minimize the penalized quadratic 
function using existing algorithms. We further study the theoretical proper- 
ties of the proposed procedure and establish conditions on the proxy matrix 
for ensuring the model selection consistency of the resulting estimate. We 
show that, with probability tending to one, the proposed procedure can se- 
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lect all true random effects. In our theoretical study, the dimensionality of 
random effects vector is allowed to grow exponentially with the sample size 
as long as the number of fixed effects is less than the total sample size. 

The rest of this paper is organized as follows. Section 2 introduces the 
penalized profile likelihood method for the estimation of fixed effects and 
establishes its oracle property. We consider the estimation of random effects 
and prove the model selection consistency of the resulting estimator in Sec- 
tion 3. Section 4 provides two simulation studies and a real data example. 
Some discussion is given in Section 5. All proofs are presented in Section 6. 

2. Penalized profile likelihood for fixed effects. Suppose that we have a 
sample of N subjects. For the ith subject, we collect the response variable 
Uij, the d x 1 covariate vector Xj,- and qx 1 covariate vector z^, for j = 
l,...,n^, where rij is the number of observations on the iih. subject. Let 
n = J2i=i n i' m n = rnaxi<j<Ar nj, and m n = mini<i<jv n%- We consider the 
case where limsup n 2p < oo, that is, the sample sizes for N subjects are 
balanced. For succinct presentation, we use matrix notation and write Yi = 
(yn,yi2, ■ ■■,yin t ) T , Xj = (xa,x. i2 , . . .,x in J T and Z» = (zii,Zj 2 ,. • .,z irii ) T . In 
linear mixed effects models, the vector of repeated measurements yj on the 
ith subject is assumed to follow the linear regression model 



where (3 is the rfx 1 population-specific fixed effects coefficient vector, 7^ 
represents the qx 1 subject-specific random effects with -y i ~ N(0,G), £i 
is the random error vector with components independent and identically 
distributed as iV(0,<7 2 ), and 7 1; . . . ,~f N ,£\, . . . ,en are independent. Here, G 
is the covariance matrix of random effects and may be different from the 
identity matrix. So the random effects can be correlated with each other. 

Let vectors y, 7 and e, and matrix X be obtained by stacking vectors 
yi, 7j and £j, and matrices Xj, respectively, underneath each other, and let 
Z = diag{Zi, . . . , Zat} and Q = diag{G, . . . , G} be block diagonal matrices. 
We further standardize the design matrix X such that each column has norm 
1/n. The linear mixed effects model (1) can be rewritten as 



2.1. Selection of important fixed effects. In this subsection, we assume 
that there are no noise random effects, and Q is positive definite. In the case 
where noise random effects exist, one can use the method in Section 3 to 
select the true ones. The joint density of y and 7 is 




Yi = Xj/3 + Zi7i + £j 



ii 



(2) 



y = X/3 + Z 7 + e. 



(3) 
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Given (3, the maximum likelihood estimate (MLE) for 7 is 7(/3) = B 2 (y — 
X/3), where B z = (Z T Z + a 2 g- 1 )- 1 Z T . Plugging 7(/3) into /(y,7) and drop- 
ping the constant term yield the following profile likelihood function: 

(4) MA7(/3)) = ex p{"^2 (y " x /3) Tp ^(y - X/3)}, 

where P z = (I - ZB^) T (I - ZB Z ) + a 2 B^Q^ 1 B z with I being the iden- 
tity matrix. By Lemma 3 in Section 6, P z can be rewritten as P z = (I + 
<7 _2 ZC/Z T ) _1 . To select the important x-variables, we propose to maximize 
the following penalized profile log-likelihood function: 

(5) \og{L n {M(P)))-nY,P^mi 

3=1 

where p\ n (x) is a penalty function with regularization parameter A n > 0. 
Here, the number of fixed effects d n may increase with sample size n. 
Maximizing (5) is equivalent to minimizing 

(6) QM = -(y - X/3) T P z (y - X/3) + n^> A „ (|/3,- |). 

3=1 

Since P z depends on the unknown covariance matrix Q and a 2 , we propose 
to use a proxy P z = (I + Z.MZ T ) _1 to replace P z , where Ai is a pre-specified 
matrix. Denote by Q n ((3) the corresponding objective function when P 2 is 
used. We will discuss in the next section how to choose A4. 

We note that (6) does not depend on the inverse of Q. So although we 
started this section with the nonsingularity assumption of G, in practice our 
method can be directly applied even when noise random effects exist, as will 
be illustrated in simulation studies of Section 4. 

Many authors have studied the selection of the penalty function to achieve 
the purpose of variable selection for the linear regression model. Tibshirani 
(1996) proposes the Lasso method by the use of L\ penalty. Fan and Li 
(2001) advocate the use of nonconvex penalties. In particular, they suggest 
the use of the SCAD penalty. Zou (2006) proposes the adaptive Lasso by 
using adaptive L\ penalty, Zhang (2010) proposes the minimax concave 
penalty (MCP), Liu and Wu (2007) propose to linearly combine Lq and 
L\ penalties and Lv and Fan (2009) introduce a unified approach to sparse 
recovery and model selection using general concave penalties. In this paper, 
we use concave penalty function for variable selection. 

Condition 1 . For each A > 0, the penalty function p\(t) with t 6 [0, oo) 
is increasing and concave with pa(0) = 0, its second order derivative exists 
and is continuous and p' x (0+) G (0,oo). Further, assume that sup t>0 p'^(t) — > 
as A -> 0. 
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Condition 1 is commonly assumed in studying regularization methods 
with concave penalties. Similar conditions can be found in Fan and Li (2001), 
Fan and Peng (2004) and Lv and Fan (2009). Although it is assumed that 
p'iit) exists and is continuous, it can be relaxed to the case where only p' x {t) 
exists and is continuous. All theoretical results presented in later sections 
can be generalized by imposing conditions on the local concavity of p\(t), 
as in Lv and Fan (2009). 

2.2. Model selection consistency. Although the proxy matrix P z may be 
different from the true one P z , solving the regularization problem (6) may 
still yield correct model selection results at the cost of some additional bias. 
We next establish conditions on P 2 to ensure the model selection oracle 
property of the proposed method. 

Let /3 be the true coefficient vector. Suppose that /3 is sparse, and denote 
s\ n = Wflo II 0; that is, the number of nonzero elements in j3 Q . Write 

where /3 10 is an si n -vector and /3 2 q is a (d n — si n )-vector. Without loss 
of generality, we assume that (3 2 q = 0, that is, the nonzero elements of 
(3 locate at the first s\ n coordinates. With a slight abuse of notation, we 
write X= (Xi,X2) with Xi being a submatrix formed by the first s± n 
columns of X and X2 being formed by the remaining columns. For a matrix 
B, let A m ; n (B) and A max (B) be its minimum and maximum eigenvalues, 
respectively. We will need the following assumptions. 

Condition 2. (A) Let a n = mini<j< Sln \/3o,j\- It holds that 

a n n T (log n) _3//2 — > 00 
with r € (0, 5) being some positive constant, and sup t>an / 2 p'{ n {t) = o(n~ 1+2r ) 

(B) There exists a constant c\ > such that A m j n (ciA4 - a~ 2 g) > and 
A min ( Cl a- 2 (logn)g-Ai)>0. 

(C) The minimum and maximum eigenvalues of matrices n~ 1 (X^Xi) 
and n e (X^P z Xi) _1 are both bounded from below and above by cq and 
Cq 1 , respectively, where 9 6 (2r, 1] and cq > is a constant. Further, it holds 
that 

<n^(logn) 3 / 4 M>n/2), 



(7) 



1 T ~ 
-XfP^Xa 

n 



(8) HXlXx^XfP^r 1 ^ <p' Xn (0+)/p' Xn (a n /2), 

where || • ||oo denotes the matrix infinity norm. 

Condition 2(A) is on the minimum signal strength a n . We allow the min- 
imum signal strength to decay with sample size n. When concave penalties 
such as SCAD [Fan and Li (2001)] or SICA [Lv and Fan (2009)] are used, 
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this condition can be easily satisfied with A n appropriately chosen. Condi- 
tions 2(B) and (C) put constraints on the proxy Ai. Condition 2(C) is about 
the design matrices X and Z. Inequality (8) requires noise variables and sig- 
nal variables not highly correlated. The upper bound of (8) depends on the 
ratio p' Xn (0+)/p' Xn (a n /2). Thus, concave penalty functions relax this condi- 
tion when compared to convex penalty functions. We will further discuss 
constraints (7) and (8) in Lemma 1. 

If the above conditions on the proxy matrix are satisfied, then the bias 
caused by using P z is small enough, and the resulting estimate still enjoys 
the model selection oracle property described in the following theorem. 

Theorem 1. Assume that ^/nX n — > oo as n — > oo and \ogd n = o(nA^). 
Then under Conditions 1 and 2, with probability tending to 1 as n — >■ oo, 
there exists a strict local minimizer (3 = (0i,02) T of Q n {(3) which satisfies 

(9) ll3i-A),llloo <^ r (logn) and p 2 = 0. 

Theorem 1 presents the weak o£acle property in the sense of Lv and Fan 
(2009) on the local minimizer of Q(/3). Due to the high dimensionality and 
the concavity of p\(-), the characterization of the global minimizer of Q(/3) 
is a challenging open question. As will be shown in the simulation and real 
data analysis, the concave function Q(f3) will be iteratively minimized by the 
local linear approximation method [Zou and Li (2008)]. Following the same 
idea as in Zou and Li (2008), it can be shown that the resulting estimate 
poesses the properties in Theorem 1 under some conditions. 

2.3. Choice of proxy matrix Ai. It is difficult to see from (7) and (8) on 
how restrictive the conditions on the proxy matrix Ai are. So we further 
discuss these conditions in the lemma below. We introduce the notation 
T = a 2 Q~ l + Z T P X .Z and E = a 2 Q" 1 + Z T Z with P x = I - Xi(XfXi) -1 Xi. 
Correspondingly, when the proxy matrix At is used, define T = Ai^ 1 + 
Z T P X Z and E = Ai' 1 + 7i T 7i. We use || • H2 to denote the matrix 2-norm, 
that is, ||B|| 2 = {A max (BB T )} 1 / 2 for a matrix B. 

Lemma 1. Assume that IK^-XfPxXi) -1 ^ < n~ T y/logn/p' Xn (a n /2) and 

(10) HT-V^TT- 1 ^ - I|| 2 < (1 + ^s\^p\ n {a n /2)\\ZT' l Z T \\ 2 r l . 
Then (7) holds. 

Similarly, assume that ||X2'P 2 Xi(Xj , P 2 Xi) _1 || 00 <p' Xn (0+)/p' Xn (a n /2), 
and there exists a constant c 2 > such that 

||T — — 1/2 j|| 

(11) < [l + n- 1 ||ZT- 1 Z T || 2 
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x max{c 2 n e ,c 1 (logn)s^ 2 A„V An (a n /2)||X| 1 p z Xi|| 2 }] \ 
||E~ 1/2 EE" 1/2 -I|| 2 

(12) < [l + A; 1 (logn) S ^ 2 (logny An (a n /2) 

x ||Z^Z r || 2 {||(XfP^X 1 )- 1 || 2 ||Xi , P^X 2 || 2 } 1 / 2 ]- 1 ) 

then (8) holds. 

Equations (10), (11) and (12) show conditions on the proxy matrix Ai. 
Note that if penalty function used is flat outside of a neighborhood of zero, 
then p' Xn (a n /2) « with appropriately chosen regularization parameter A n , 
and conditions (10) and (12), respectively, reduce to 

(13) IIT-^TT -1 / 2 - I|| 2 < 1, ||E- 1 / 2 EE -1 >' 2 -I|| 2 <1. 

Furthermore, since Z is a block diagonal matrix, if the maximum eigenvalue 
of ZT _1 Z T is of the order o(n 1-61 ), then condition (11) reduces to 

(14) ||T~ 1/2 TT- 1/2 - I|| 2 < 1. 

Conditions (13) and (14) are equivalent to assuming that T" 1 /2TT- 1 /2 anc i 
E~ 1 / 2 EE~ 1 / 2 have eigenvalues bounded between and 2. By linear algebra, 
they can further be reduced to 1 1 T — 1 T 1 1 2 < 2 and 1 1 E~ 1 E 1 1 2 < 2. It is seen 
from the definitions of T, T, E and E that if eigenvalues of ZP X Z T and 
ZZ T dominate those of a 2 Q~ l by a larger order of magnitude, then these 
conditions are not difficult to satisfy. In fact, note that both ZP X Z T and ZZ T 
have components with magnitudes increasing with n, while the components 
of <7 2 £/ _1 are independent of n. Thus as long as both matrices ZP X Z T and 
ZZ T are nonsingular, these conditions will easily be satisfied with the choice 
M = (logn)I when n is large enough. 

3. Identifying important random effects. In this section, we allow the 
number of random effects q to increase with sample size n and write it as q n 
to emphasize its dependency on n. We focus on the case where the number 
of fixed effects d n is smaller than the total sample size n = ^£i n «- We 
discuss the d n >n case in the discussion Section 5. The major goal of this 
section is to select important random effects. 

3.1. Regularized posterior mode estimate. The estimation of random ef- 
fects is different from the estimation of fixed effects, as the vector 7 is ran- 
dom. The empirical Bayes method has been used to estimate the random 
effects vector *y in the literature. See, for example, Box and Tiao (1973), 
Gelman et al. (1995) and Verbeke and Molenberghs (2000). Although the 
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empirical Bayes method is useful in estimating random effects in many situ- 
ations, it cannot be used to select important random effects. Moreover, the 
performance of an empirical Bayes estimate largely depends on the accu- 
racy of estimated fixed effects. These difficulties call for a new proposal for 
random effects selection. 

Patterson and Thompson (1971) propose the error contrast method to 
obtain the restricted maximum likelihood of a linear model. Following their 
notation, define the n x (n — d) matrix A by the conditions AA T = P x and 
A T A = I, where P x = I — X(X T X)~ 1 X T . Then the vector A T s provides a 
particular set of n — d linearly independent error contrasts. Let wi = A T y. 
The following proposition characterizes the conditional distribution of wi: 

Proposition 1. Given -y, the density function of w± takes the form 

(15) / Wl (A T y| 7 ) = (2vra 2 )- (n - d)/2 exp j-^(y - Zj) T P x (y - Z 7 )j. 

The above conditional probability is independent of the fixed effects vector 
(3 and the error contrast matrix A, which allows us to obtain a posterior 
mode estimate of 7 without estimating /3 and calculating A. 

Let 9Ko C {1, 2, . . . , q n } be the index set of the true random effects. Define 

m = = iq n + k, for i = 0, 1,2, . . . ,N - 1 and k 6 9tt } 

and denote by 9Kq = {1, 2, . . . ,Nq n } \ 9Ko- Then 9JIq is the index set of 
nonzero random effects coefficients in the vector 7, and 9Jt * s the index 
set of the zero ones. Let S2n = ||9^o||o be the number of true random effects. 
Then ||9Jto||o = ^ s 2n- We allow Ns2 n to diverge with sample size n, which 
covers both the case where the number of subjects N diverges with n alone 
and the case where N and S2 n diverge with n simultaneously. 

For any S C {1, . . . , q n N}, we use Z5 to denote the (q n N) x |5| submatrix 
of Z formed by columns with indices in S, and *y s to denote the subvector 
of 7 formed by components with indices in S. Then 7^ ~ A^O,^^) with 
Qqji a submatrix formed by entries of Q with row and column indices in 9JIq . 
In view of (15), the restricted posterior density of 7^ can be derived as 

Ui(lW \ AT y) K /wi(A T y|7M )/(^M () ) 

« exp {"^2 (y " Z ™ () 7™ ) T Px(y - Z^ o 7^ o ) - ^7^,^7fflfej- 

Therefore, the restricted posterior mode estimate of 7^ is the solution to 
the following minimization problem: 

(16) min{ (y - Z Mo 7 Mq ) t P x . (y - Z Mo 7 Mq ) + a 2 j^ ^7 Mo } • 
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In practice, since the true random effects 9J?o are unknown, the formula- 
tion (16) does not help us estimate 7. To overcome this difficulty, note that 
Z OTo 7 5W = 27 and 7 !t ^Mo 7 ^o = with ^ + the Moore-Penrose gen- 

eralized inverse of Q. Thus the objective function in (16) is rewritten as 

(y - Z 7 ) T P iE (y - Z 7 ) + oVS + 7. 

which no longer depends on the unknown 93To . Observe that if the /cth ran- 
dom effect is a noise one, then the corresponding standard deviation is 0, 
and the coefficients 7^ for all subjects i = 1, . . . , N should equal to 0. This 
leads us to consider group variable selection strategy to identify true random 
effects. Define 7.^ = (X)£Li n /fk)^ 2 i k = 1, . . . ,q n , and consider the following 
regularization problem: 

1 1 Qn 

(17) -(y - ZjfP x (y - Z 7 ) + -<J 2 1 T Q + 1 + nJ^PxM, 

k=l 

where p\ n (-) is the penalty function with regularization parameter A n > 
0. The penalty function here may be different from the one in Section 2. 
However, to ease the presentation, we use the same notation. 

There are several advantages to estimating the random effects vector 7 
using the above proposed method (17). First, this method does not require 
knowing or estimating the fixed effects vector (3, so it is easy to implement, 
and the estimation error of (3 has no impact on the estimation of 7. In 
addition, by using the group variable selection technique, the true random 
effects can be simultaneously selected and estimated. 

In practice, the covariance matrix Q and the variance a 2 are both un- 
known. Thus, we replace a~ 2 Q with Ai, where Ai = diag{M, . . . , M} with 
M a proxy of G, yielding the following regularization problem: 

1 1 9n 

(18) ££(7) = -(y - Z 7 ) T P,(y - Z 7 ) + -7 T A^~ 1 7 + n^PXn (7-fc)- 

fc=i 

It is interesting to observe that the form of regularization in (18) includes 
the elastic net [Zou and Hastie (2005)] and the adaptive elastic net [Zou and 
Zhang (2009)] as special cases. Furthermore, the optimization algorithm for 
adaptive elastic net can be modified for minimizing (18). 

3.2. Asymptotic properties. Minimizing (18) yields an estimate of 7, de- 
noted by 7. In this subsection, we study the asymptotic property of 7. 
Because 7 is random rather than a deterministic parameter vector, the ex- 
isting formulation for the asymptotic analysis of a regularization problem is 
inapplicable to our setting. Thus, asymptotic analysis of 7 is challenging. 

Let T = Z^Z + a 2 Q + and T = Z T P X Z + M~ l . Denote by Tu = 
Z !to P * Z 2Ko + ^(^r 1 ' T 22 = %^f^ a » d T12 = Z^ o P,Z^. Sim- 
ilarly, we can define submatrices Tu, T22 and T12 by replacing a~ 2 Q with 
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A4. Then it is easy to see that T12 = T12. Notice that if the oracle informa- 
tion of set dJlo is available and Q, and a 2 are known, then the Bayes estimate 
of the true random effects coefficient vector 7^ has the form Z|l P x y. 

Define 7* = ((7*f , ■ • • , (7^) T ) T with 7* = (7*°, . . . , 7 * ?n f for j = 1,°.. . , N 
as the oracle-assisted Bayes estimate of the random effects vector. Then 
T*OTo = ^ anc ^ 7*ot = ^ii 1 ^^)"^ 1 '^' Correspondingly, define 7* as the ora- 
cle Bayses estimate with proxy matrix, that is, 7^ = and 



( 19 ) % = T uZ|P*y- 

For k = 1, ...,q n , let 7* fe = {X^jLi(7jfe) 2 } 1//2 - Throughout we condition on 
the event 

(20) n' = {^' t >VN K } 

with 6q G (0, min e gj{ o^) and cr^ = var^j.). The above event fi* is to ensure 
that the oracle-assisted estimator j*u/'*/N of cifc is not too negatively biased. 

Condition 3. (A) The maximum eigenvalues satisfy A max (ZjGZf ) < 
c^S2n for all i = 1,...,N and the minimum and maximum eigenvalues of 
m~ 1 Z^ P x Z^ o and Ggjt are bounded from below and above by C3 and 

Cg 1 , respectively, with m n = maxi<j<Arnj, where C3 is a positive constant. 
Further, assume that for some S € (0, ^), 



(21) llTn'lloo^ 



„~t ~ i„ Pa (0+) 

22 max Z^P x Zs» T7, L < A "_ — , 

where Z,- is the submatrix formed by the N columns of Z corresponding to 
the jth random effect. 

(B) It holds that snp {t ^ bmP '{ ri (t) = o(N- 1 ). 

(C) The proxy matrix satisfies A m j n (A^ — cr~ 2 G) > 0. 

Condition 3(A) is about the design matrices X, Z and covariance ma- 
trix Q. Since Z™ is a block diagonal matrix and lim sup max ' — < 00, the 

Sno mini rii ' 

components of Z^ P^Z^ have magnitude of the order m n = 0{n/N). 
Thus, it is not very restrictive to assume that the minimum and maxi- 
mum eigenvalues of Z^P x Z^ o are both of the order m n . Condition (22) 
puts an upper bound on the correlation between noise covariates and true 
covariates. The upper bound of (22) depends on the penalty function. Note 
that for concave penalty we have p' x (0+)/p' x (\/iV&o/2) > 1, whereas for 
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L\ penalty p' Xn (0+)/p' Xn (yNbQ/2) = 1. Thus, concave penalty relaxes (22) 
when compared with the L\ penalty. Condition 3(B) is satisfied by many 
commonly used penalties with appropriately chosen A n , for example, L\ 
penalty, SCAD penalty and SICA penalty with small a. Condition 3(C) is 
a restriction on the proxy matrix Ai, which will be further discussed in the 
next subsection. 

Let 7 = (jj, . . . ,7at) T with lj = (ijU ■ ■ • >7jg„) T being an arbitrary (Nq n )- 
vector. Define 7& = (Ylf=i ^jk) 1 ^ 2 f° r eacn k = l,...,q n . Let 
(23) Wl(j) = {k£{l,...,q n }: rk ^0}. 

Theorem 2 below shows that there exists a local minimizer of Qn(l) defined 
in (18) whose support is the same as the true one 9Jto, and that this local 
minimizer is close to the oracle estimator 7*. 



Theorem 2. Assume that Conditions 1 and 3 hold, b^rr/y/N — > 00, 
\og{Nq n ) = o{n 2 \^ l /{Ns2 n fnn)), and n 2 A^/ (Nm n s ^ n ) — >■ 00 as n — >• 00. Then, 
with probability tending to 1, there exists a strict local minimizer 7 £ H Nqn 
°f Qnil) suc h that 



N \ 1/2 

2 { ^ „-5 

. 



Tl (7) = Wl and max J — ^ (j jk ■ 



r- 



where 5 is defined in (21). 



Using a similar argument to that for Theorem 1, we can obtain that 
the dimensionality Nq n is also allowed to grow exponentially with sample 
size n under some growth conditions and with appropriately chosen A n . In 
fact, note that if the sample sizes n\ = ■ ■ ■ = = m n /N, then the growth 
condition in Theorem 2 becomes log(Nq n ) = o(ns^ n 1 A 2 l ). Since the lowest 
signal level in this case is y/Nb^, if 6q is a constant, a reasonable choice of 
tuning parameter would be of the order y/~Nn~ K with some k G (0, ^). For 
S2n = 0(n u ) with v G [0, i) and Nn 1 ~ 2K ~ u — > 00, we obtain that Nq n can 
grow with rate exp(A r n 1_ ). 

3.3. Choice of proxy matrix M . Similarly as for the fixed effects selection 
and estimation, we discuss (21) and (22) in the following lemma. 

Lemma 2. Assume that ||Tr- 1 1 || 00 < ,,%^~ S , > [1 7^ — 1 and 

(24) HT^Tii - I|| 2 < [1 + N /i^i^n 1 ^(>/i?6S/2)||Tri 1 || 2 ]- 1 . 
Then (21) holds. 
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Assume that max jeOT c ||ZjP x Z^j o T n || 2 < —, A ( ^y fe » /2) with z i defined 
in (22) and 

(25) HTuTJi 1 - I|| 2 < 1. 

Then (22) holds. 



Conditions (24) and (25) put restrictions on the proxy matrix A4. Simi- 
larly to the discussions after Lemma 1, if p'^ (VNoq/2) ~ 0, then these condi- 
tions become HTuT^ — 1||2 < 1. If Z^ P x Z^ o dominates o~ 2< 3^ by a larger 
magnitude, then conditions (24) and (25) are not restrictive, and choosing 
M = (logn)I should make these conditions as well as Condition 3(C) satis- 
fied for large enough n. 

We remark that using the proxy matrix Ai = (logn)I is equivalent to 
ignoring correlations among random effects. The idea of using diagonal ma- 
trix as a proxy of covariance matrix has been proposed in other settings 
of high-dimensional statistical inference. For instance, the naive Bayes rule 
(or independence rule), which replaces the full covariance matrix in Fisher's 
discriminant analysis with a diagonal matrix, has been demonstrated to be 
advantageous for high-dimensional classifications both theoretically [Bickcl 
and Levina (2004), Fan and Fan (2008)] and empirically [Dudoit, Fridlyand 
and Speed (2002)]. The intuition is that although ignoring correlations gives 
only a biased estimate of covariance matrix, it avoids the errors caused by 
estimating a large amount of parameters in covariance matrix in high di- 
mensions. Since the accumulated estimation error can be much larger than 
the bias, using diagonal proxy matrix indeed produces better results. 

4. Simulation and application. In this section, we investigate the finite- 
sample performance of the proposed procedures by simulation studies and a 
real data analysis. Throughout, the SCAD penalty with a = 3.7 [Fan and Li 
(2001)] is used. For each simulation study, we randomly simulate 200 data 
sets. Tuning parameter selection plays an important role in regularization 
methods. For fixed effect selection, both AIC- and BIC-selectors [Zhang, Li 
and Tsai (2010)] are used to select the regularization parameter A n in (6). 
Our simulation results clearly indicate that the BIC-selector performs better 
than the AlC-selector for both the SCAD and the LASSO penalties. This 
is consistent with the theoretical analysis in Wang, Li and Tsai (2007). To 
save space, we report the results with the BIC-selector. Furthermore the 
BIC-selector is used for fixed effect selection throughout this section. For 
random effect selection, both AIC- and BIC-selectors are also used to select 
the regularization parameter A n in (18). Our simulation results imply that 
the BIC-selector outperforms the AlC-selector for the LASSO penalty, while 
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the SCAD with AlC-selector performs better than the SCAD with BIC- 
selector. As a result, we use AlC-selector for the SCAD and BIC-selector for 
the LASSO for random effect selection throughout this section. 

Example 1. We compare our method with some existing ones in the 
literature under the same model setting as that in Bondell, Krishna and 
Ghosh (2010), where a joint variable selection method for fixed and random 
effects in linear mixed effects models is proposed. The underlying true model 
takes the following form with q = 4 random effects and d = 9 fixed effects: 

(26) yij = bn + PiXiji + p 2 Xij2 + b i2 Zij 1 + b^z^ + £%j , £y ~u.d. N(0, 1), 

where the true parameter vector /3 = (1, 1,0, . . . , 0) T , the true covariance 
matrix for random effects 



and the covariates for k = 1,...,9 and Zj,-/ for I = 1,2,3 are gener- 
ated independently from a uniform distribution over the interval [—2,2]. 
So there are three true random effects and two true fixed effects. Follow- 
ing Bondell, Krishna and Ghosh (2010), we consider two different sample 
sizes N = 30 subjects and rii = 5 observations per subject, and N = 60 and 
rii = 10. Under this model setting, Bondell, Krishna and Ghosh (2010) com- 
pared their method with various methods in the literature, and simulations 
therein demonstrate that their method outperforms the competing ones. So 
we will only compare our methods with the one in Bondell, Krishna and 
Ghosh (2010). 

In implementation, the proxy matrix is chosen as M = (logn)I. We then 
estimate the fixed effects vector j3 by minimizing Q n ((3), and the random 
effects vector 7 by minimizing (18). To understand the effects of using proxy 
matrix M on the estimated random effects and fixed effects, we compare 
our estimates with the ones obtained by solving regularization problems (6) 
and (17) with the true value a~ 2 Q. 

Table 1 summarizes the results by using our method with the proxy matrix 
M and SCAD penalty (SCAD-P), our method with proxy matrix Ai and 
Lasso penalty (Lasso-P), our method with true a~ 2 Q and SCAD penalty 
(SCAD-T). When SCAD penalty is used, the local linear approximation 
(LLA) method proposed by Zou and Li (2008) is employed to solve these 
regularization problems. The rows "M-ALASSO" in Table 1 correspond to 
the joint estimation method by Bondell, Krishna and Ghosh (2010) using 
BIC to select the tuning parameter. As demonstrated in Bondell, Krishna 
and Ghosh (2010), the BIC-selector outperforms the AlC-selector for M- 
ALASSO. We compare these methods by calculating the percentage of times 
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Table 1 

Fixed and random effects selection in Example 1 
when d = 9 and q — 4 



Setting 


Method 


%CF 


%CR 


AT = 30 


Lasso-P 


51 


19.5 


n, = 5 


SCAD-P 


90 


86 




SCAD-T 


93.5 


99 




M-ALASSO 


73 


79 


A = 60 


Lasso-P 


52 


50.5 


m = 10 


SCAD-P 


100 


100 




SCAD-T 


100 


100 




M-ALASSO 


S3 


89 



the correct fixed effects are selected (%CF), and the percentage of times the 
correct random effects are selected (%CR). Since these two measures were 
also used in Bondell, Krishna and Ghosh (2010), for simplicity and fairness of 
comparison, the results for M-ALASSO in Table 1 are copied from Bondell, 
Krishna and Ghosh (2010). 

It is seen from Table 1 that SCAD-P greatly outperforms Lasso-P and 
M-ALASSO. We also see that when the true covariance matrix a~ 2 Q is 
used, SCAD-T has almost perfect variable selection results. Using the proxy 
matrix makes the results slightly inferior, but the difference vanishes for 
larger sample size N = 60, rij = 10. 

Example 2. In this example, we consider the case where the design 
matrices for fixed and random effects overlap. The sample size is fixed at 
rii = 8 and N = 30, and the numbers for fixed and random effects are chosen 
to be d = 100 and q = 10, respectively. To generate the fixed effects design 
matrix, we first independently generate Xjj from A^(0, £), where S = (cr s t) 
with a s t = /o' s_ *' and p £ (—1,1). Then for the jth observation of the ith 
subject, we set x^ = I{xijk > 0) for covariates k = 1 and d, and set = 
Xijk for all other values of k. Thus 2 out of d covariates are discrete ones and 
the rest are continuous ones. Moreover, all covariates are correlated with each 
other. The covariates for random effects are the same as the corresponding 
ones for fixed effects, that is, for the jth observation of the ith subject, we 
set Zijk = Xijk f° r k = 1, . . . , q = 10. Then the random effect covariates form 
a subset of fixed effect covariates. 

The first six elements of fixed effects vector /3 are (2,0, 1.5,0,0, 1) T , and 
the remaining elements are all zero. The random effects vector 7 is generated 
in the same way as in Example 1. So the first covariate is discrete and 
has both nonzero fixed and random effect. We consider different values of 
correlation level p, as shown in Table 2. We choose A4 = (logra)I. 
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Since the dimension of random effects vector 7 is much larger than the 
total sample size, as suggested at the beginning of Section 2.1, we start 
with the random effects selection by first choosing a relatively small tuning 
parameter A and use our method in Section 3 to select important random 
effects. Then with the selected random effects, we apply our method in 
Section 2 to select fixed effects. To improve the selection results for random 
effects, we further use our method in Section 3 with the newly selected fixed 
effects to reselect random effects. This iterative procedure is applied to both 
Lasso-P and SCAD-P methods. For SCAD-T, since the true a~ 2 G is used, 
it is unnecessary to use the iterative procedure, and we apply our methods 
only once for both fixed and random effects selection and estimation. 

We evaluate each estimate by calculating the relative L2 estimation loss 

RL2(3) = ||3-/3 ||2/||/3oll2, 

where (3 is an estimate of the fixed effects vector /9 . Similarly, the relative 
L\ estimation error of f3, denoted by RLl(/3), can be calculated by replacing 
the L2-norm with the Li-norm. For the random effects estimation, we define 
RL2( ; y) and RLl( ; y) in a similar way by replacing f3 Q with the true 7 in each 
simulation. We calculate the mean values of RL2 and RL1 in the simulations 
and denote them by MRL2 and MRL1 in Table 2. In addition to mean 
relative losses, we also calculate the percentages of missed true covaritates 
(FNR), as well as the percentages of falsely selected noise covariates (FPR), 
to evaluate the performance of proposed methods. 

From Table 2 we see that SCAD-T has almost perfect variable selection 
results for fixed effects, while SCAD-P has highly comparable performance, 
for all three values of correlation level p. Both methods greatly outperform 

Table 2 

Fixed and random effects selection and estimation in Example 2 when m — 8, N — 30, 
d = 100, q — 10 and design matrices for fixed and random effects overlap 



Random effects Fixed effects 







FNR 


FPR 


MRL2 


MRL1 


FNR 


FPR 


MRL2 


MRL1 


Setting 


Method 


(%) 


(%) 






(%) 


(96) 






p = 0.3 


Lasso-P 


11.83 


9.50 


0.532 


0.619 


62.67 


0.41 


0.841 


0.758 




SCAD-P 


0.50 


1.07 


0.298 


0.348 


0.83 


0.03 


0.142 


0.109 




SCAD-T 


3.83 


0.00 


0.522 


0.141 


0.33 


0.02 


0.102 


0.082 


p=-0.3 


Lasso-P 


23.67 


7.64 


0.524 


0.580 


59.17 


0.41 


0.802 


0.745 




SCAD-P 


1.83 


0.71 


0.308 


0.352 


0.67 


0.05 


0.141 


0.109 




SCAD-T 


3.17 


0.00 


0.546 


0.141 


0.17 


0.02 


0.095 


0.078 


p = 0.5 


Lasso-P 


9.83 


10.07 


0.548 


0.631 


60.33 


0.48 


0.844 


0.751 




SCAD-P 


1.67 


0.50 


0.303 


0.346 


0.17 


0.05 


0.138 


0.110 




SCAD-T 


5.00 


0.00 


0.532 


0.149 


0.50 


0.02 


0.113 


0.091 
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the Lasso-P method. For the random effects selection, both SCAD-P and 
SCAD-T perform very well with SCAD-T having slightly larger false nega- 
tive rates. We remark that the superior performance of SCAD-P is partially 
because of the iterative procedure. In these high-dimensional settings, di- 
rectly applying our random effects selection method in Section 3 produces 
slightly inferior results to the ones for SCAD-T in Table 2, but iterating 
once improves the results. We also see that as the correlation level increases, 
the performance of all methods become worse, but the SCAD-P is still com- 
parable to SCAD-T, and both perform very well in all settings. 

Example 3. We illustrate our new procedures through an empirical 
analysis of a subset of data collected in the Multi-center AIDs Cohort Study. 
Details of the study design, method and medical implications have been 
given by Kaslow et al. (1987). This data set comprises the human immun- 
odeficiency virus (HIV) status of 284 homosexual men who were infected 
with HIV during the follow-up period between 1984 and 1991. All patients 
are scheduled to take measurements semiannually. However, due to the miss- 
ing of scheduled visits and the random occurrence of HIV infections, there 
are an unequal number of measurements and different measurement times 
for each patients. The total number of observations is 1765. 

Of interest is to investigate the relation between the mean CD4 percent- 
age after the infection (y) and predictors smoking status (x\, 1 for smoker 
and for nonsmoker), age at infection (X2), and pre-HIV infection CD4 
percentage (Pre-CD4 for short, X3). To account for the effect of time, we 
use a five-dimensional cubic spline h(t) = (bi(t),b2{t), . . . ,b^(t)) T . We take 
into account the two-way interactions b(tij)xis, xnXi2, xnXiz and Xi2Xi$. 
These eight interactions together with variables h(tij), xn, Xi2 and give 
us 16 variables in total. We use these 16 variables together with an intercept 
to fit a mixed effects model with dimensions for fixed and random effects 
d = q = 17. The estimation results are listed in Table 3 with rows "Fixed" 
showing the estimated /3j's for fixed effects, and rows "Random" showing 
the estimates j.k/^/N. The standard error for the null model is 11.45, and it 



Table 3 

The estimated coefficients of fixed and random, effects in Example 3 





Intercept 


M*) 


M*) 


M*) 


M*) 


M*) 


Xl 


x 2 


x 3 


Fixed 


29.28 


9.56 


5.75 





-8.32 





4.95 








Random 































bi(t)cc 3 


b 2 (t)x 3 


b 3 (t)x 3 




b 5 (t)x 3 


XxX 2 


X1X3 


3:2333 




Fixed 




























Random 


0.163 


0.153 


0.057 


0.043 


0.059 





0.028 


0.055 
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0.6 



0.4 



0.2 



-0.2 



-0.4 



i 



b1(t)x3 b2(t)x3 b3(t)x3 b4(t)x3 b5(t)x3 



x1x3 



x2x3 



Fig. 1. Boxplots of selected random effects. From left to right: bi(t)xs, £ = 1,2,..., 5, 
X1X3, X2X3, where xi is the smoking status, X2 is the age at infection, X3 is Pre-CD4 level 
and bi(t) 's are cubic spline basis functions of time. 



reduces to 3.76 for the selected model. Prom Table 3, it can be seen that the 
baseline has time-variant fixed effect and Pre-CD4 has time-variant random 
effect. Smoking has fixed effect while age and Pre-CD4 have no fixed effects. 
The interactions smoking x Pre-CD4 and age x Pre-CD4 have random ef- 
fects with smallest standard deviations among selected random effects. The 
boxplots of the selected random effects are shown in Figure 1. 

Our results have close connections with the ones in Huang, Wu and Zhou 
(2002) and Qu and Li (2006), where the former used bootstrap approach to 
test the significance of variables and the later proposed hypothesis test based 
on penalized spline and quadratic inference function approaches, for varying- 
coefficient models. Both papers revealed significant evidence for time- varying 
baseline, which is consistent with our discovery that basis functions bj(t)'s 
have nonzero fixed effect coefficients. At 5% level, Huang, Wu and Zhou 
(2002) failed to reject the hypothesis of constant Pre-CD4 effect (p- value 
0.059), while Qu and Li's (2006) test was weakly significant with p- value 
0.045. Our results show that Pre-CD4 has constant fixed effect and time- 
varying random effect, which may provide an explanation on the small dif- 
ference of p-values in Huang, Wu and Zhou (2002) and Qu and Li (2006). 

To further access the significance of selected fixed effects, we refit the 
linear mixed effects model with selected fixed and random effects using the 
Matlab function "nlmefit." Based on the i-statistics from the refitted model, 
the intercept, the baseline functions b\(t) and are all highly significant 
with i-statistics much larger than 7, while the t-statistics for 64 (i) and x\ 
(smoking) are —1.026 and 2.216, respectively. This indicates that b±(t) is 
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insignificant, and smoking is only weakly significant at 5% significance level. 
This result is different from those in Huang, Wu and Zhou (2002) and Qu 
and Li (2006), where neither paper found significant evidence for smoking. 
A possible explanation is that by taking into account random effects and 
variable selection, our method has better discovery power. 

5. Discussion. We have discussed the selection and estimation of fixed 
effects in Section 2, providing that the random effects vector has nonsingular 
covariance matrix, while we have discussed the selection of random effects in 
Section 3, providing that the dimension of fixed effects vector is smaller than 
the sample size. We have also illustrated our methods with numerical studies. 
In practical implementation, the dimensions of the random effects vector and 
fixed effects vector can be both much larger than the total sample size. In 
such case, we suggest an iterative way to select and estimate the fixed and 
random effects. Specifically, we can first start with the fixed effects selection 
using the penalized least squares by ignoring all random effects to reduce the 
number of fixed effects to below sample size. Then in the second step, with 
the selected fixed effects, we can apply our new method in Section 3 to select 
important random effects. Third, with the selected random effects from the 
second step, we can use our method in Section 2 to further select important 
fixed effects. We can also iterate the second and third steps several times to 
improve the model selection and estimation results. 

6. Proofs. Lemma 3 is proved in the supplemental article Fan and Li 
(2012). 

Lemma 3. It holds that 

p 2 = (i - zB z fn~ l (i - zb z ) + B^g~ l B z = (n + zgz T y\ 

6.1. Proof of Theorem 1. Let Nq = {(3 = (/3f , f3%) T : \\(3 l - 00,1 Hoc < 
n _T (logn), (5 2 = G R rfn_Sln }. We are going to show that under Condi- 
tions 1 and 2, there exists a strict local minimizer (3 G A/"o of Q n {j3) with 
asymptotic probability one. 

For a vector (3 = (/Si, . . . ,(3 p ) T , let p' Xn (f3) be a vector of the same length 
whose jth component is p'\ n (\Pj\) sgn(/3j), j = l,...,d n . By Lv and Fan 

(2009), the sufficient conditions for 3 = (3i\0 T ) T G K dn with 3 a G R Sl ™ 
being a strict local minimizer of Q n ((3) are 

(27) -XfP z (y-X 1 3 1 )+np / A j3 1 ) = 0, 

(28) HV2IU W An (0+), 

(29) A min (XfP z X 1 )>-np / ^(|^|), j = l,...,s ln , 

where V2 = X 2 n P z (y — Xi/CJjJ. So we only need to show that with probability 
tending to 1, there exists a j3 G A/q satisfying conditions (27)-(29). 
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We first consider (27). Since y = Xi/3 i + Z7 + e, equation (27) can be 
rewritten as 

(30) 3i - 0o,i = (Xf P.Xi^Xf P Z (Z 7 + e) - n(Xf P,Xx)- Va„(3i)- 
Define a vector-valued continuous function 

5 (/3i) =/3i - /3 ,i " (XfP,X 1 )- 1 X? , P z (Z7 + e) + n(Xf P z Xi)" Va^i) 
with /3 1 £ R Sln . It suffices to show that with probability tending to 1, there 
exists (3 = (0i,02) T £ ftfo such that g{P\) = 0. To this end, first note that 

(xfp,Xi)- 1 x?'p,(Z7 + e) ~ jv(o, (xf p^y 1 *? p.p-^^x^xf P.Xx)- 1 ). 

By Condition 2(B), the matrix ciP^-P z P- 1 P 2 =P 2 Z(ci^-cr- 2 ^)Z T P z > 
0, where ^4 > means the matrix A is positive semi-definite. Therefore, 

(31) V = (Xf P 2 X 1 )- 1 XfP z P; 1 P a Xi(XfP z X 1 )- 1 < ciCXfPsXi)- 1 . 

Thus, the jth diagonal component of matrix V in (31) is bounded from 
above by the jth diagonal component of ci(X^P 2 Xi) -1 . Further note that 
by Condition 2(B), P" 1 - ciQognJPj 1 < Z(M - c 1 ^^G)Z T < 0. Recall 
that by linear algebra, if two positive definite matrices A and B satisfy 
A > B, then it follows from^the Woodbury formula that A -1 < I? . Thus, 
(cilogn)P z > P 2 and (XfP^Xi)" 1 < ( Ci logn)(Xf P^Xx) -1 . So by Condi- 
tion 2(C), the diagonal components of V in (31) are bounded from above 
by 0(n~ e (logn)). This indicates that the variance of each component of the 
normal random vector (X^P z Xi)~ 1 X^P 2 (Z7 + e) is bounded from above 
by 0{n~ e ilogn)). Hence, by Condition 2(C), 

|| (Xf P Z X! )^ 1 X 1 P Z (Z 7 + e)\\ oa = O v {n~»l V(log n) (log s ln )) 

(32) 

= o p (n r (logn)). 

Next, by Condition 2(A), for any (3 = (/3i, . . . , j3d n ) T £ M) an d large enough n, 
we can obtain that 

(33) |/%|>|A)jM/3oj-/%|>a n /2, j = l,...,s ln . 

Since p' x (x) is a decreasing function in (0,oo), we have \\p? x (fli)\\oo < 
Px (°n/2)- This together with Condition 2(C) ensures that 

IKxfp^rX^oiL < IKxTp.xo^ilhp^gsoIL 

(34) 

< o(n T 1 (logn)). 

Combining (32) and (34) ensures that with probability tending to 1, if n is 
large enough, 

||(Xi r P,Xi)- 1 XiP a (Z 7 + e) + n(Xf P 2 Xi)- Va(/3i)IL < ^(logn). 
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Applying Miranda's existence theorem [Vrahatis (1989)] to the function 

g(/3i) ensures that there exists a vector (3 1 € R Sln satisfying \\/3 1 — /3 jJIqq < 

n~ r logn such that g((3i) = 0. 

Now we prove that the solution to (27) satisfies (28). Plugging y = Xi/3 l + 
Z7 + e into v in (28) and by (30), we obtain that 

v 2 = XfP 2 Xi(A),i - 3i) + x£P 2 (Z 7 + e) = v 2 ,i + v 2 , 2 , 

where v 2 ,i = J-Xf P*Xi(Xj PsXi^Xf P* + X^P Z ](Z 7 + e) and v 2>2 = 

X^P 2 Xi(XfP 2 X 1 )- 1 p A „(3i)- Since (Z7 + e) ~ N (0, Pj 1 ), it is easy to see 
that v 2j i has normal distribution with mean and variance 

Xj (I - P 2 Xx(Xf P^X!)- x Xf )P Z P; 1 P Z (I - XiCX^Xi^XfPsjXa. 

Since P" 1 < ciP" 1 , I — P^X^X^PjiXi^X^P?/ 2 is a projection matrix, 
and P z has eigenvalues less than 1, it follows that for the unit vector ej., 

e^var(v 2il )e fc < ciefX^P, - P z Xi(Xf P^Xi^X? 1 P,)X 2 e fc 
< cie^ X 2 P z X 2 e fe < e£ XjX 2 e fc = cm, 

where in the the last step, each column of X is standardized to have L 2 -norm 
yfn. Thus the diagonal elements of the covariance matrix of vi 2 are bounded 
from above by c\n. Therefore, for some large enough constant C > 0, 

P(||v 2 ,i||oo > V / 2Cnlogd n ) < (d n - s ln )P(\N(0, Cl n)\ > ^/2Cn\ogd n ) 

= (d n ~ sin)exp(-q 1 Clogd n ) -)• 0. 
Thus, it follows from the assumption log<i n , = o(n\^ l ) that 
||v 2 ,i||oo = O v (^Jn log d n ) = o p (np' Xn (0+)). 
Moreover, by Conditions 2(B) and (C), 

||v 2| 2||oo ^nHX^XiCXfP^Xi)- 1 !!^^) <n/ An (0+). 

Therefore inequality (28) holds with probability tending to 1 as n — > 00. 

Finally we prove that (3 £ Mq satisfying (27) and (28) also makes (29) 
hold with probability tending to 1. By (33) and Condition 2(A), 

0<-np'{ n m)<-n sup p'{ n (t) = o(n^). 

t>a„/2 

On the other hand, by Condition 2(C), A mm (X.JP z ~Ki) > cqu 8 . Since 9 > 2r, 
inequality (34) holds with probability tending to 1 as n — > 00. 

Combing the above results, we have shown that with probability tending 
to 1 as n — > 00, there exists (3 £ A/"o which is a strict local minimizer of 
Q n ((3). This completes the proof. 
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6.2. Proof of Theorem 2. Let 7 = (7^ , . . . , ^ T N ) T £ R 9 ' lAr with 7J = ( 7jl , 

■ • • ijjqn) be a R^^-vector satisfying 9^(7) = 9Jto- Define 11(7) = (uj , . . . , 
u^)^ £ K Nqn with Uj = (uji, Uj qn ) T , where for j = 1, . . . , TV, 

(35) AnUjfc = p' Xn (rf.khjkh-k if A; £ 2>t(7) 

and A n Uj fc = if A; ^ 971(7). Here, 7.^ = {Y^f=i I'jkY^ 2 - Let 7* be the oracle- 
assisted estimate defined in (19). By Lv and Fan (2009), the sufficient con- 
ditions for 7 with 7rj^c = being a strict local minimizer of (18) are 

( 36 ) " 7 ?«o = nA nTnu(7^ ), 

/ N \ V 2 

(37) \J^ k ) <np ' A " (0+) ' fcG9 ^' 

(38) Amin(Tn)>nA max ^-^-^p An (7. fe )^, 

where w(7) = (wf , . . . , w^-) T £ R^" with Wj = (lUji, . . . , Wj gn ) T , and 

(39) w( 7 ) = Z T P :c (y-Z7)-M- 1 7. 

We will show that, under Conditions 1 and 3, conditions (36)-(38) above 
are satisfied with probability tending to 1 in a small neighborhood of 7*. 

In general, it is not always guaranteed that (36) has a solution. We first 
show that under Condition 3, there exists a vector 7* with 971(7*) = 97To 
such that 7*^ makes (36) hold. To this end, we constrain the objective 

function Q* n ("y) defined in (18) on the (7Vs n 2)-dimensional subspace B = 
{7 £ Si qnN : 7^ = 0} of K qnN . Next define 

M = | 7 £ B: max j £( 7jfe - 7 * fc ) 2 j < v 7 ^ 5 j. 

For any 7 = ( 7 n, . . . ,7i 5n , . . . ,7tvi, • • • , lNq n ) T G M and fc £ 97t , we have 

1/2 



|7-7*lloo < max. 



(40) % 



'AT ^ 1 

^(7^-7lfc) 2 | 





I/ 2 ( N } V 2 



< VNn~ s + j. k . 

Note that by Condition 3(C), we have TTj 1 > TTj 1 . Thus it can be derived 
using linear algebra and the definitions of 7 * fc and 7* fc that 7 * fc > 7 * fe . Since 
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we condition on the event f2* in (20), it is seen that for large enough n, 
(41) 7-fe > 7*fe - VNn~ S > y* k - VNn~ S > VNb* Q /2 

for k E QJto and 7 E A/i- Thus, in view of the definition of 11(7) in (35), for 
k E OJTq , we have 



II Vi(7M () )lloc < maxp' An (7. fc ) <Pa„(V^o/2), 

where in the last step, p'\ n (t) is decreasing in i E (0, 00) due to the concavity 
of p\ n (t). This together with (21) in Condition 3 ensures 

(42) ||nA n TrM733o)IL < n||T^^J^\T&5/2) < ^Nn~ 5 . 

Now define the vector- valued continuous function ^ (£)=£— 7^— nAnTT^ u(£), 
with £ a R Ws2 " -vector. Combining (40) and (42) and applying Miranda's 
existence theorem [Vrahatis (1989)] to the function ^(£), we conclude that 
there exists 7* E M\ such that 7^ is a solution to equation (36). 

We next show that 7* defined above indeed satisfies (38). Note that for 
any vector 

d 2 xx T ( 1 xx T \ 

(43) ^A„(||X|| 2 ) =^(11x113)^+^(11x113)^ - W J . 

Since —p' x (t) < and —p'^(t) > for t E (0, 00), we have 



An»(-^PAn(IW| a )) < -PL(I|X|| 2 ) + 



2 X 2 



VLdlxlb)- 



Since 7^ E M\, by (41) we have 7* fe > VNbl/2 for E 9JT - It follows from 
the above inequality and Condition 3(B) that with probability tending to 1, 
the maximum eigenvalue of the matrix — -A — (S?=iPA n (7*fc)) is less than 

m %?(-P\ n (%)) = °( iV ~ 1 ) = o(m n /n). 

Further, by Condition 3(A), ±A min (T n ) = iA^Zg^Z^) > c 3 ^f . Thus 
the maximum eigenvalue of the matrix — -A — (Yl'j=iP\n(l*k)) is ^ ess than 

n~ 1 A m ; n (Tii) with asymptotic probability 1, and (38) holds for 7*. 

It remains to show that 7* satisfies (37). Let v = 7* — 7*. Since 7* is a 

solution to (36), we have v = nA n T7j 11(7^). In view of (39), we have 
w (%) = ( Z Wf ~ ^Tr/Zj: )P x y + Tf 2 % o 
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(44) = (Zgjg - T^Z^P^Zt + e) + tT ~_ 
= Wl + w 2 . 

Since Z7 + e ~ N (0, P" 1 ), we obtain that wi~iV(0,H) with 

H = ( Z llo ~ T 12 T n Z it ) P a : Pz lp a;( Z OTo ~ Z OTo T n T 12)- 
Note that Z^c is a block diagonal matrix, and the zth block matrix has size 

ni x (q n — s 2n ). By Condition 3(A), it is easy to see that A max (ZC?Z T ) < 
maxi< i < iV A max (Z i GZf ) < Cl s 2n . Thus, P.P^P, = P*(<t 2 I + ZgZ T )P x < 
(a 2 + cis 2n )P 2 = C " 2 + ciS2n)Px- Further, it follows from Ti 2 = T 12 and 
Zj o P,Z Mo <T n that 

H < {a 2 + c lS2n )(Z^c - T^T^Zg^P^Z^ - Z^T^Tu) 
= (o- 2 + ClS 2n ) 

x (ZljcP^Z^c +Tf 2 T5" 1 1 Z|j o P :); Z Mo T^ 1 1 Ti2 - 2Z| f cP a ,Z Mo T7 1 Ti2) 

< (<x 2 + c lS2n )(Z^P x Z M e _ T^T^Tia) < (a 2 + Cl s 2n )Z^P x Z M c. 

Thus, the ith diagonal element of H is bounded from above by the ith 
diagonal element of (cr 2 + cis 2n )Z^- c P x Z™c i and is thus bounded by cis 2n m n 

with c\ some positive constant. Therefore by the normality of wi we have 

1 /2 

P(\ 

< N(q n - s 2n )P(\N (0,cis 2n m n )\ > {2c 1 s 2n ?n n log(N(q n - s 2n ))} ) 
= 0((log(N(q n -s 2n ))r 1/2 ) = o(l). 
Therefore, ||wi = o p ({s 2n m n log(iV(g n - s 2n )} 1 / 2 ) = o p (nA^ _1 / 2 A n ) and 

( N \ 1/2 

(45) max|^wi jfc > < VN\\^i\\oo = o p (n\ n ) = o p (l)np' Xn (0+), 
J>S2n [ k =i ' J 

where is the ((j — l)g n + fc)th element of A?g n -vector wi. 

Now we consider w 2 . Define Z,- as the submatrix of Z formed by columns 
corresponding to the jth random effect. Then, for each j = s 2n + 1, . . . , q n , 
by Condition 3(A) we obtain that 

f N \ 1/2 

|E^4 =nA„||ZjP x Z^ o T n 1 u(7^ o )|| 2 

^nllAnU^^ll^lZjP^Z^T^ 1 ^, 
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where W2,jk is the ((j — l)q n + k)th element of iVg n -vector W2. Since 7^ G Mi, 
by (35), (41) and the decreasing property of p'\ n (-) we have ||A n u(7^ () )||2 < 
p' Xn (VNb* /2). By (22) in Condition 3(A), 



Combing the above result for W2 with (44) and (45), we have shown that (37) 
holds with asymptotic probability one. This completes the proof. 



Supplement to "Variable selection in linear mixed effects models" (DOI: 
10.1214/12-AOS1028SUPP; .pdf). We included additional simulation exam- 
ples and technical proofs omitted from the main text: simulation Exam- 
ples A.1-A.3, and technical proofs of Lemmas 1-3 and Proposition 1. 
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